07-Importing your own data and factors

Working directories

Before we starting analyzing today’s data, let’s first talk about the notion of working directories.

The working directory is where R looks for files that you ask it to load, and where it will put any files that you ask it to save. You can see your current working directory at the top of the console (yours may be different from mine):

You can also see it through the getwd() function.

As you get more experienced and start handling more projects, it’s a good idea to organize your projects into directories and, when working on a project, set the working directory to the project’s directory. That way, you will know where to find your project files and you won’t mix files from different projects together.

You can change the working directory using the setwd() function. An easier way is to go the menu bar and select Session > Set Working Directory, then choose one of the options there.

R Scripts

Why write R scripts? R scripts facilitate easy storage, running and sharing of code. For example:

If we wanted to get the results of our drought analysis for a different county or state, we already have all the R commands saved in an .R file: all we have to do is make some minor changes to select the correct county and re-run all the commands.
If we wanted to run this script every week to get a plot based on the latest information, we just need to run the script.
If a friend wanted to learn how to run our analysis, we just have to hand him/her the R script instead of walking them through each step.

How to write R scripts? Just type commands in the window for the file, with each command on its own line.

It can be difficult to write an R script all at once. Instead, we can use the following workflow to make sure that script works as it should:

Type a line of code in the window for the file.
Highlight the line of code, then execute it by selecting the button > Run Selected Line(s) (or using the Cmd-Enter or Ctrl-Enter shortcut). This action copies the code to the console and runs it.
Check that the result that you get is what you want. If it is not, amend the code and perform step 2 again.
Once you are done, save the R script and exit RStudio. Open RStudio and the R script, highlight all the code and run it. If you didn’t make any mistakes with your code, it should run as you intended.

For all the code below, follow the workflow above, i.e. type it into the window for the R script, then run it in the console.

Loading the NBA dataset

Today we’ll be working with an NBA player dataset that I downloaded from Kaggle. We will be working with a refined version of this dataset. For those who are interested, you can download the raw dataset at the Kaggle link and access the script I used to process the data here.

Download the NBA dataset from the course website. Next, make sure that the working directory is the folder where the NBA dataset is located. To load the dataset into R, click on the “Import Dataset” button in the “Environment” pane, then click “From Text (readr)…”

For “File/Url”, click the “Browse” button on the right and locate the NBA dataset. Within a short period of time, the “Data Preview”, “Import Options” and “Code Preview” sections are populated:

First, look at the “Code Preview” section. This is the code that R is using in order to produce the dataset seen in the “Data Preview”. It loads the readr package, then uses the read_csv() function to read in the data file.

Next, look at the “Data Preview” section. Notice how each column has a type associated with it. How does read_csv() know what type each column is? From the documentation, read_csv() looks at the first 1000 rows in the dataset and makes a guess. It’s often correct, but sometimes it’s not.

Now, if we click the “Import” button, R will execute the code in the “Code Preview” section in the console. This is usually not what we want to do since we want to keep any code we execute in a script. D this instead:

Highlight the code (except for the last line starting with View) and copy it (either by Ctrl-C or Cmd-C, or Right click > Copy).
Click the “Cancel” button, then paste the code into our R script.
Amend the variable which the output of read_csv(...) is assigned to df. Also amend library(readr) to library(tidyverse): loading the tidyverse package loads readr as well as other packages that we will use today.

You should end up with the code below (with comments added). Run it to import the dataset!

# load NBA dataset
library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

df <- read_csv("nba_tidy.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   player = col_character(),
##   team = col_character(),
##   college = col_character(),
##   birth_state = col_character()
## )

## See spec(...) for full column specifications.

Consider what the code above does when someone else opens it on their computer. R looks for the file nba_tidy.csv in the present working directory. Hence, anyone using this script must make sure this file is in the present working directory; if not an error will occur when the script is run.

Examining the dataset

Use the functions that you have learnt so far to examine the dataset. What does each row correspond to? How many rows and columns are there?

Each row in this dataset corresponds to a player that played at some point during the NBA 2016-2017 season. Here is a short summary of the variables in the dataset: many of them are standard statistics that are recorded for basketball games.

player: Name of player.
team: Team that player was on for the season. If the player was on more than one team, this refers to the team that the player played the most games with.
G: No. of games played.
GS: No. of games played as a starter.
MP: Total minutes played.
FG: No. of successful field goals (i.e. 2-point or 3-point shots).
FGA: No. of field goals attempted.
3P: No. of successful 3-point shots.
3PA: No. of 3-point shots attempted.
FT: No. of successful free throws.
FTA: No. of free throws attempted.
ORB: No. of offensive rebounds.
DRB: No. of defensive rebounds.
AST: No. of assists.
STL: No. of steals.
BLK: No. of blocks.
TOV: No. of turnovers.
PF: No. of personal fouls.
PTS: Total points scored.
height: Height of player in centimeters.
weight: Weight of player in kilograms.
college: Where the player went to college.
birth_year: Year the player was born.
birth_state: State the player was born.

Let’s add two more columns to this dataset: field goal percentage (i.e. percentage of field goals which were successful), and the age of the player in 2019.

df$FGpct <- df$FG / df$FGA * 100
df$age <- 2019 - df$birth_year

Saving and reading

At this point, our data frame df contains slightly different data. This may be something that we want to save to our local drive, so that in the future we can use this file directly, instead of loading the original one and making the changes.

We have 2 options for doing that. The first is to save it as a .csv file with readr’s write_csv function. Type the following in the console:

write_csv(df, "nba_tidy2.csv")

This saves the value of df to the file nba_tidy2.csv.

The second option is to save it into an .rds file. Type the following in the console:

saveRDS(df, "nba_tidy2.rds")

(The saved file should appear in your working directory.) To read from an .rds file, use the readRDS function. The code below loads whatever is in nba_tidy2.rds and assigns it to the variable df2.

df2 <- readRDS("nba_tidy2.rds")

While write_csv only works for data frames, saveRDS works for any R object.

Changing factor levels with `fct_recode()`

Let’s look at just the 4 teams in california for now:

# look at the teams in california
ca_df <- df %>%
    filter(team %in% c("GSW", "LAC", "LAL", "SAC"))

Which team attempted the most number of field goals? We can answer this question with some dplyr work:

# total FG by team
ca_df %>%
    group_by(team) %>%
    summarize(tot_FG = sum(FG))

## # A tibble: 4 x 2
##   team  tot_FG
##   <chr>  <dbl>
## 1 GSW     3489
## 2 LAC     3242
## 3 LAL     3231
## 4 SAC     3066

For someone who doesn’t know basketball well, the team names as acronyms may not make sense. We can use fct_recode() to replace the acronyms with the full name so that the output is more interpretable:

# relabel the team names
ca_df <- ca_df %>% mutate(team = fct_recode(team,
    "Golden State Warriors" = "GSW",
    "Los Angeles Clippers" = "LAC",
    "Los Angeles Lakers" = "LAL",
    "Sacramento Kings" = "SAC"))

# total FG by team
ca_df %>%
    group_by(team) %>%
    summarize(tot_FG = sum(FG))

## # A tibble: 4 x 2
##   team                  tot_FG
##   <fct>                  <dbl>
## 1 Golden State Warriors   3489
## 2 Los Angeles Clippers    3242
## 3 Los Angeles Lakers      3231
## 4 Sacramento Kings        3066

In fct_recode(), the new level names are on the left while the old level names are on the right. It’s possible to have the same level name appear more than once on the left: this causes different levels to be grouped together. Also, any old level names that don’t appear on the right remain untouched.

Collapsing factor levels with `fct_collapse()`

There are a total of 30 NBA teams, and they are grouped into 6 divisions (roughly by geography), each with 5 teams. To add this information into the data frame, we can do so using fct_collapse():

# create division column
df <- df %>% mutate(division = fct_collapse(team,
    Atlantic  = c("BOS", "BRK", "NYK", "PHI", "TOR"),
    Central   = c("CHI", "CLE", "DET", "IND", "MIL"),
    Southeast = c("ATL", "CHO", "MIA", "ORL", "WAS"),
    Northwest = c("DEN", "MIN", "OKC", "POR", "UTA"),
    Pacific   = c("GSW", "LAC", "LAL", "PHO", "SAC"),
    Southwest = c("DAL", "HOU", "MEM", "NOP", "SAS")))

This allows us to answer questions at the division level, e.g. how many points did players in each division score?

# most points by division
df %>% group_by(division) %>%
    summarize(tot_pts = sum(PTS)) %>%
    arrange(desc(tot_pts))

## # A tibble: 6 x 2
##   division  tot_pts
##   <fct>       <dbl>
## 1 Pacific     44298
## 2 Atlantic    43599
## 3 Southwest   43470
## 4 Northwest   43449
## 5 Central     42857
## 6 Southeast   42080

Lumping infrequent categories together with `fct_lump()`

Which college produced the most number of NBA players? Again, this can be answered using dplyr functions (we should filter out the players who have NA for college):

# no. of players by college (excluding NAs)
df %>% filter(!is.na(college)) %>% 
    group_by(college) %>%
    summarize(count = n()) %>%
    arrange(desc(count))

## # A tibble: 108 x 2
##    college                               count
##    <chr>                                 <int>
##  1 University of Kentucky                   24
##  2 Duke University                          18
##  3 University of Kansas                     14
##  4 Syracuse University                      12
##  5 University of California, Los Angeles    12
##  6 Louisiana State University                9
##  7 University of Arizona                     9
##  8 University of Florida                     9
##  9 Michigan State University                 8
## 10 University of North Carolina              8
## # … with 98 more rows

From the summary, we can see that there are a total of 108 colleges represented. If we wanted to see just the top 10, we could add head() to the pipe, but then we don’t know how many other players there were for other colleges.

# no. of players by college (excluding NAs)
df %>% filter(!is.na(college)) %>% 
    group_by(college) %>%
    summarize(count = n()) %>%
    arrange(desc(count)) %>%
    head(n = 10)

## # A tibble: 10 x 2
##    college                               count
##    <chr>                                 <int>
##  1 University of Kentucky                   24
##  2 Duke University                          18
##  3 University of Kansas                     14
##  4 Syracuse University                      12
##  5 University of California, Los Angeles    12
##  6 Louisiana State University                9
##  7 University of Arizona                     9
##  8 University of Florida                     9
##  9 Michigan State University                 8
## 10 University of North Carolina              8

Instead, we could change the college variable using fct_lump(), which lumps the least common factor levels together into an “Other” category. By specifying n = 10, we tell fct_lump() to keep the most common n = 10 values.

# no. of players by college (excluding NAs)
df %>% filter(!is.na(college)) %>% 
    mutate(college = fct_lump(college, n = 10)) %>%
    group_by(college) %>%
    summarize(count = n()) %>%
    arrange(desc(count))

## # A tibble: 11 x 2
##    college                               count
##    <fct>                                 <int>
##  1 Other                                   225
##  2 University of Kentucky                   24
##  3 Duke University                          18
##  4 University of Kansas                     14
##  5 Syracuse University                      12
##  6 University of California, Los Angeles    12
##  7 Louisiana State University                9
##  8 University of Arizona                     9
##  9 University of Florida                     9
## 10 Michigan State University                 8
## 11 University of North Carolina              8

This summary tells us that the vast majority of players don’t come from the top 10 colleges represented.

Ordering a bar plot using `fct_infreq()` and `fct_rev()`

How many players were there in each division? We can answer this question with a bar plot:

ggplot(df) +
    geom_bar(aes(x = division))

The bars don’t seem to be arranged in an intuitive order. fct_infreq() allows us to arrange them by frequency:

ggplot(df) +
    geom_bar(aes(x = fct_infreq(division)))

This orders the bars from tallest to shortest. If we want to order them from shortest to tallest, we can invert the factor ordering using fct_rev():

ggplot(df) +
    geom_bar(aes(x = fct_rev(fct_infreq(division))))

The code can be written more elegantly using pipe notation:

ggplot(df) +
    geom_bar(aes(x = division %>% fct_infreq() %>% fct_rev()))

Ordering other plots with `fct_reorder()`

Which team attempted the most number of free throws? What was the distribution of free throw attempts like? We can make a plot of number of free throws attempted by team. Notice that I have swapped the x and y axes here to make the plot easier to read.

# most freethrows (unordered)
df %>% group_by(team) %>%
    summarize(total_FTA = sum(FTA)) %>%
    ggplot() +
    geom_point(aes(x = total_FTA, y = team))

The teams are ordered alphabetically, with ATL at the bottom and WAS on top. This makes it easy to locate a specific team of interest, but it makes it difficult to tell where each team is in relation to the others. We can use fct_reorder() to order the teams based on their total free throws attempted values:

# most freethrows (ordered)
df %>% group_by(team) %>%
    summarize(total_FTA = sum(FTA)) %>%
    ggplot() +
    geom_point(aes(y = fct_reorder(team, total_FTA), x = total_FTA))

From this, it is clear that PHO had the most number of free throws while DAL had the least. We can also see a clear break between DAL and DET and the rest of the teams.

Which is the oldest team? The data visualization below gives a boxplot of age for each team (we remove NAs first and flip axes for readability):

# age of players by team (unordered)
df %>%
    filter(!is.na(age)) %>%
    ggplot() +
    geom_boxplot(aes(x = team, y = age)) +
    coord_flip()

Again, the teams are ordered alphabetically. If we replace x = team with x = fct_reorder(team, age), then the team variable will be ordered by the age values. By default, it will order them by the median of the age values: we can see this by comparing the lines in the middle of the boxplots.

# age of players by team (ordered)
df %>%
    filter(!is.na(age)) %>%
    ggplot() +
    geom_boxplot(aes(x = fct_reorder(team, age), y = age)) +
    coord_flip()

Below, we order the teams by the maximum age on each team instead.

# age of players by team (ordered)
df %>%
    filter(!is.na(age)) %>%
    ggplot() +
    geom_boxplot(aes(x = fct_reorder(team, age, max), y = age)) +
    coord_flip()

We can use fct_reorder() for bar plots as well if we are using geom_col(), not geom_bar(). For example, say we want a visualization of the top 10 point scorers for the season, with the bar color depicting the player’s field goal percentage. This is an initial attempt without ordering the players:

# top 10 scorers (unordered)
df %>% top_n(n = 10, wt = PTS) %>%
    ggplot() +
    geom_col(aes(x = player, y = PTS, fill = FGpct)) +
    coord_flip()

We may try to arrange the dataset in the order we want before passing it to ggplot(), but that doesn’t work: the players will still be ordered in their default order, i.e. alphabetically.

# top 10 scorers (ordered: doesn't work!)
df %>% top_n(n = 10, wt = PTS) %>%
    arrange(desc(PTS)) %>%
    ggplot() +
    geom_col(aes(x = player, y = PTS, fill = FGpct)) +
    coord_flip()

We can use fct_reorder() to order the players:

# top 10 scorers (ordered)
df %>% top_n(n = 10, wt = PTS) %>%
    arrange(desc(PTS)) %>%
    ggplot() +
    geom_col(aes(x = fct_reorder(player, PTS), y = PTS, fill = FGpct)) +
    coord_flip()

Once the basic plot is done, we can add some bells and whistles to make the plot more informative and appealing:

# top 10 scorers (ordered: nicer)
df %>% top_n(n = 10, wt = PTS) %>%
    arrange(desc(PTS)) %>%
    ggplot() +
    geom_col(aes(x = fct_reorder(player, PTS), y = PTS, fill = FGpct)) +
    scale_fill_gradient(low = "orange", high = "blue") +
    coord_flip() +
    labs(title = "Top 10 players with most points",
         x = NULL, y = "Points")

Optional material

`ggplot2` and `dplyr` practice

Below are some exercises for you to practice your dplyr skills. All of them use the df dataset as the starting point. (It doesn’t matter if the division column has been created or not; it will not affect the results below.) Solutions are in the section below.

Plot a histogram of minutes played (MP).
Make a scatterplot of FGpct vs. FGA. Set alpha to 0.5.
Make the same plot as above, but only include players with at least 200 field goal attempts (i.e. FGA >= 200). Add a geom_smooth() layer to the plot.
Who are the top ten players by FGA? Give just the players’ names, FGA and FGpct values.

`ggplot2` and `dplyr` practice (solutions)

Plot a histogram of minutes played (MP).

ggplot(df) +
    geom_histogram(aes(x = MP))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Make a scatterplot of FGpct vs. FGA. Set alpha to 0.5.

ggplot(df) +
    geom_point(aes(x = FGA, y = FGpct), alpha = 0.5)

## Warning: Removed 1 rows containing missing values (geom_point).

Make the same plot as above, but only include players with at least 200 field goal attempts (i.e. FGA >= 200). Add a geom_smooth() layer to the plot.

df %>% filter(FGA >= 200) %>%
    ggplot(aes(x = FGA, y = FGpct)) +
        geom_point(alpha = 0.5) +
        geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Who are the top ten players by FGA? Give just the players’ names, FGA and FGpct values.

df %>% arrange(desc(FGA)) %>%
    head(n = 10) %>%
    select(player, FGA, FGpct)

## # A tibble: 10 x 3
##    player               FGA FGpct
##    <chr>              <dbl> <dbl>
##  1 Russell Westbrook   1941  42.5
##  2 Andrew Wiggins      1570  45.2
##  3 DeMar DeRozan       1545  46.7
##  4 James Harden        1533  44.0
##  5 Anthony Davis       1527  50.4
##  6 Damian Lillard      1488  44.4
##  7 Karl-Anthony Towns  1479  54.2
##  8 Isaiah Thomas       1473  46.3
##  9 Kemba Walker        1449  44.4
## 10 Stephen Curry       1443  46.8

Session info

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.4.0   stringr_1.4.0   dplyr_0.8.3     purrr_0.3.2    
## [5] readr_1.3.1     tidyr_0.8.3     tibble_2.1.3    ggplot2_3.2.1  
## [9] tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2       cellranger_1.1.0 pillar_1.4.2     compiler_3.6.1  
##  [5] tools_3.6.1      zeallot_0.1.0    digest_0.6.20    lubridate_1.7.4 
##  [9] jsonlite_1.6     evaluate_0.14    nlme_3.1-140     gtable_0.3.0    
## [13] lattice_0.20-38  pkgconfig_2.0.2  rlang_0.4.0      cli_1.1.0       
## [17] rstudioapi_0.10  yaml_2.2.0       haven_2.1.1      xfun_0.9        
## [21] withr_2.1.2      xml2_1.2.2       httr_1.4.1       knitr_1.24      
## [25] vctrs_0.2.0      generics_0.0.2   hms_0.5.1        grid_3.6.1      
## [29] tidyselect_0.2.5 glue_1.3.1       R6_2.4.0         fansi_0.4.0     
## [33] readxl_1.3.1     rmarkdown_1.15   modelr_0.1.5     magrittr_1.5    
## [37] ellipsis_0.3.0   backports_1.1.4  scales_1.0.0     htmltools_0.3.6 
## [41] rvest_0.3.4      assertthat_0.2.1 colorspace_1.4-1 labeling_0.3    
## [45] utf8_1.1.4       stringi_1.4.3    lazyeval_0.2.2   munsell_0.5.0   
## [49] broom_0.5.2      crayon_1.3.4

07-Importing your own data and factors

Kenneth Tay

Oct 15, 2019

Working directories

R Scripts

Loading the NBA dataset

Examining the dataset

Saving and reading

Changing factor levels with `fct_recode()`

Collapsing factor levels with `fct_collapse()`

Lumping infrequent categories together with `fct_lump()`

Ordering a bar plot using `fct_infreq()` and `fct_rev()`

Ordering other plots with `fct_reorder()`

Optional material

`ggplot2` and `dplyr` practice

`ggplot2` and `dplyr` practice (solutions)

Session info

07-Importing your own data and factors

Kenneth Tay

Oct 15, 2019

Working directories

R Scripts

Loading the NBA dataset

Examining the dataset

Saving and reading

Changing factor levels with fct_recode()

Collapsing factor levels with fct_collapse()

Lumping infrequent categories together with fct_lump()

Ordering a bar plot using fct_infreq() and fct_rev()

Ordering other plots with fct_reorder()

Optional material

ggplot2 and dplyr practice

ggplot2 and dplyr practice (solutions)

Session info

Changing factor levels with `fct_recode()`

Collapsing factor levels with `fct_collapse()`

Lumping infrequent categories together with `fct_lump()`

Ordering a bar plot using `fct_infreq()` and `fct_rev()`

Ordering other plots with `fct_reorder()`

`ggplot2` and `dplyr` practice

`ggplot2` and `dplyr` practice (solutions)